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Abstract 

Changes in population size influence genetic diversity of the population and, as a result, leave 
a signature of these changes in individual genomes in the population. We are interested in the 
inverse problem of reconstructing past population dynamics from genomic data. We start with 
a standard framework based on the coalescent, a stochastic process that generates genealogies 
connecting randomly sampled individuals from the population of interest. These genealogies 
serve as a glue between the population demographic history and genomic sequences. It turns 
out that only the times of genealogical lineage coalescences contain information about popula- 
tion size dynamics. Viewing these coalescent times as a point process, estimating population size 
trajectories is equivalent to estimating a conditional intensity of this point process. Therefore, 
our inverse problem is similar to estimating an inhomogeneous Poisson process intensity func- 
tion. We demonstrate how recent advances in Gaussian process-based nonparametric inference 
for Poisson processes can be extended to Bayesian nonparametric estimation of population size 
dynamics under the coalescent. We compare our Gaussian process (GP) approach to one of 
the state of the art Gaussian Markov random field (GMRF) methods for estimating population 
trajectories. Using simulated data, we demonstrate that our method has better accuracy and 
precision. Next, we analyze two genealogies reconstructed from real sequences of hepatitis C 
and human Influenza A viruses. In both cases, we recover more believed aspects of the viral 
demographic histories than the GMRF approach. We also find that our GP method produces 
more reasonable uncertainty estimates than the GMRF method. 



1 Introduction 

Statistical inference in population genetics increasingly relies on the coalescent (Kingman, 1982), 
the probability model that describes the relationship between a gene genealogy of a random sample 
of molecular sequences and effective population size. This model provides a good approxima- 
tion to the distribution of ancestral histories that arise from classical population genetics models 
(Rosenberg and Nordborg, 2002). More importantly coalescent-based inference methods allow us 
to estimate population genetic parameters, including population size trajectories, directly from 
genomic sequences (Griffiths and Tavare, 1994). Recent examples of coalescent-based population 
dynamics estimation include reconstructing demographic histories of musk ox (Campos et al., 2010) 
from fossil DNA samples and elucidating patterns of genetic diversity of the dengue virus (Bennett 
et al., 2010). 

Here, we are interested in estimating effective population size trajectories from gene genealo- 
gies. The effective population size is an abstract parameter that for a real biological population 
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is proportional to the rate at which genetic diversity is lost or gained. In the absence of natural 
selection, the effective population size can be used to approximate census population size by know- 
ing the generation time in calendar units (e.g. years) and the population variability in number of 
offspring (Wakeley and Sargsyan, 2009). The latter quantity might be difficult to know; however, 
sometimes it suffices to analyze an arbitrarily rescaled population size trajectory, assuming the 
variability in number of offspring remains constant. The effective population size is equal to the 
census population size in an idealized Wright-Fisher model. The Wright-Fisher model is a simple 
and established model of neutral reproduction in population genetics that assumes random mating 
and non-overlapping generations. For some RNA viruses, for example human influenza A virus, the 
effective population size rescaled by generation time (3 to 4 days) cannot be interpreted directly 
as the effective number of infections because of the presence of strong natural selection. However, 
one can always adopt a more cautious interpretation of the effective population size as a measure 
of relative genetic diversity (Rambaut et al., 2008; Frost and Volz, 2010). 

Coalescent-based methods for estimation of population size dynamics have evolved from strin- 
gent parametric assumptions, such as constant population size or exponential growth (Griffiths 
and Tavare, 1994; Kuhner et al., 1998; Drummond et al., 2002), to more flexible nonparamet- 
ric approaches that assume piecewise linear population trajectories (Strimmer and Pybus, 2001; 
Opgen-Rhein et al., 2005; Drummond et al., 2005; Heled and Drummond, 2008; Minin et al., 2008). 
The latter class of methods is more appropriate in the absence of prior knowledge about the un- 
derlying demographic dynamics, allowing researchers to infer shapes of population size trajectories 
rather than to impose parametric constraints on these shapes. These nonparametric methods, how- 
ever, model population dynamics by imposing a priori piecewise continuous functions which require 
regularization either by smoothing or by controlling the number of change points, also a priori. 
The former regularization - which works better in practice (Minin et al., 2008) - is inherently 
difficult because these piecewise continuous functions are defined on intervals of varying size. The 
piecewise nature of these methods creates further modeling problems if one wishes to incorporate 
covariates into the model or impose constraints on population size dynamics (Minin et al., 2008). 
In this paper, we propose to solve these problems by bringing the coalescent-based estimation of 
population dynamics up to speed with modern Bayesian nonparametric methods. Making this leap 
in statistical methodology will allow us to avoid artificial discretization of population trajectories, 
to perform regularization without making arbitrary scale choices, and, in the future, to extend our 
method into a multivariate setting. 

Our key insight stems from the fact that the coalescent with variable population size is an 
inhomogeneous continuous-time Markov chain (Tavare, 2004) and, therefore, can be viewed as an 
inhomogeneous point process (Andersen et al., 1995). In fact, all current Bayesian nonparametric 
methods of estimation of population size dynamics resemble early Bayesian approaches to non- 
parametric estimation of the Poisson intensity function via piecewise continuous functions (Arjas 
and Heikkinen, 1997). Estimation of the intensity function of an inhomogeneous Poisson process 
is a mature field that evolved from maximum likelihood estimation under parametric assumptions 
(Brillinger, 1979) to frequentist (Diggle, 1985) and, more recently, Bayesian nonparametric methods 
(Arjas and Heikkinen, 1997; M0ller et al., 1998; Kottas and Sanso, 2007; Adams et al., 2009). 

Following Adams et al. (2009), we a priori assume that population trajectories follow a trans- 
formed Gaussian process (GP), allowing us to model the population trajectory as a continuous 
function. This is a convenient way to specify prior beliefs without a particular functional form 
on the population trajectory. The drawback of such a flexible prior is that the likelihood func- 
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tion involves integration over an infinite-dimensional random object and, as a result, likelihood 
evaluation becomes intractable. Fortunately, we are able to avoid this intractability and perform 
inference exactly by adopting recent algorithmic developments proposed by Adams et al. (2009). 
We achieve tractability by a novel data augmentation for the coalescent process that relies on 
thinning algorithms for simulating the coalescent. 

Thinning is an accept /reject algorithm that was first proposed by Lewis and Shedler (1979) 
for the simulation of inhomogeneous Poisson processes and was later extended to a more general 
class of point processes by Ogata (1981). In the spirit of Ogata (1981), we develop novel thinning 
algorithms for the simulation of the coalescent. These algorithms, interesting in their own right, 
open the door for latent variable representation of the coalescent. This representation leads to a 
new data augmentation that is computationally tractable and amenable to standard Markov chain 
Monte Carlo (MCMC) sampling from the posterior distribution of model parameters and latent 
variables. 

We test our method on simulated data and compare its performance with a representative 
piecewise linear approach, a Gaussian Markov random field (GMRF) based method (Minin et al., 
2008). We demonstrate that our method is more accurate and more precise than the GMRF method 
in all simulation scenarios. We also apply our method to two real data sets that have been previously 
analyzed in the literature: a hepatitis C virus (HCV) genealogy estimated from sequences sampled 
in 1993 in Egypt and a genealogy of the H3N2 human influenza A virus estimated from sequences 
sampled in New York state between 2002 and 2005. In the HCV analysis, we successfully recover 
all believed key aspects of the population size trajectory. Compared to the GMRF method, our GP 
method better reflects the uncertainty inherent in the HCV data. In our second real data example, 
our GP method successfully reconstructs a population trajectory of the human influenza A virus 
with an expected seasonal series of peaks followed by population bottlenecks, while the GMRF 
method's reconstructed trajectory fails to recover some of the peaks and bottlenecks. 

2 Methods 

2.1 Coalescent Background 

The coalescent model allows us to trace the ancestry of a random sample of n genomic sequences. 
These ancestral relationships are represented by a genealogy or tree; the times at which two se- 
quences or lineages merge into a common ancestor are called coalescent times. The coalescent with 
variable population size can be viewed as a non-homogeneous Markov death process that starts 
with n lineages at present time t n = and decreases by one, with time running backwards, until 
reaching one lineage at ti, at which point the samples have been traced to their most recent com- 
mon ancestor (Griffiths and Tavare, 1994). Here, we assume that a genealogy with time measured 
in units of generations is observed. The shape of the genealogy depends on the effective population 
size trajectory, N e (t), and the number of samples accumulated through time: the larger the effective 
population size, the longer two lineages need to wait to meet a common ancestor and the larger 
the sample size, the faster two lineages coalesce. Formally, let t n = denote the present time when 
all the n available sequences are sampled {isochronous coalescent) and let t n = < i n _i <•••<<! 
denote the coalescent times of lineages in the genealogy with time going backwards. Then, the 
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Intervals: 0,2 ; 

Coalescent Times: £ 1 l 2 
Sampling Times: 
Number of samples: 

Figure 1: Example of a genealogy relating serially sampled sequences (heterochronous sampling). The 
number of lineages changes every time we move between intervals (Ii.k). Each cndpoint of an interval is a 
coalescent time ({ifc}£ =1 ) or a sampling time ({sj}JLi)- The number of sequences sampled at time Sj is 
denoted by rij. 

conditional density of the coalescent time t k _\ takes the following form: 

where C k = (2) is the coalescent factor that depends on the number of lineages k = 2, . . . , n. 

The heterochronous coalescent arises when samples of sequences are collected at different times 
(Figure 1). Such serially sampled data are common in studies of rapidly evolving viruses and 
analyses of ancient DNA (Campos et ah, 2010). Let t n = < t n ~i < ■ ■ ■ < t\ denote the coalescent 
times as before, but now let s m = < s m _i < ■ ■ ■ < si < so = ti denote sampling times 
of n m , . . . ,ni sequences respectively, X^j=i n i = n - Further, let s and n denote the vectors of 
sampling times and numbers of sequences sampled at these times, respectively (Figure 1). Now, 
the coalescent factor changes not only at the coalescent events but also at the sampling times. Let 

h,k = (max{t k ,Sj},t k -i], for sj < t k -i and k = 2,...,n, (2) 

be the intervals that end with a coalescent event and 

Ii,k = (max{t k , s j+i }, s j+i -i], for s j+i -i > t k and Sj < t k -i, k = 2, . . . , n, (3) 

be the intervals that end with a sampling event. We denote the number of lineages in Ii >k with ni jk . 
Then, for k = 2, n, 

P[tt _ lK s , n , m)] ^* + ± I H*} ■ W 



l 0,3 'o.4: b,4 '0,5 



n 1= 2 



r^=2 nj=l i\j=3 
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where = ("2*)- That is, the density for the next coalescent time t^-i is the product of the 
density of the coalescent time t^-i £ io,fc an d the probability of not having a coalescent event 
during the period of time spanned by intervals I± t k, • • • , I m ,k (Felsenstein and Rodrigo, 1999). 

2.2 Gaussian Process Prior for Population Size Trajectories 

For both isochronous or heterochronous data, we place the same prior on N e (t): 

A 



N e (t) = 
where 



1 + exp {-/(*)} 



(5) 



f(t)~gv{o,c(0)) (6) 

and GV(0, C(6)) denotes a Gaussian process with mean function and covariance function C(0) 
with hyperparameters 0. A priori, 1/N e (t) is a Sigmoidal Gaussian Process, a scaled logistic 
function of a Gaussian process whose range is restricted to lie in [0, A]; A is a positive constant 
hyperparameter, the inverse of which serves as a lower bound of N e (t) (Adams et al., 2009). 

A Gaussian process is a stochastic process such that any finite sample from the process has 
a joint Gaussian distribution. The process is completely specified by its mean and covariance 
functions (Rasmussen and Williams, 2005). For computational convenience we use Brownian motion 
as our Gaussian process prior. Generating a finite sample from a Gaussian processes requires 0(n 3 ) 
computations due to the inversion of the covariance matrix. However, when the precision matrix, 
the inverse of the covariance, is sparse, such simulations can be accomplished much faster (Rue 
and Held, 2005). For example, when we choose to work with a Brownian motion with covariance 
matrix elements C(t, t ) = ^(min(t, t )) and precision parameter 9, then the inverse of this matrix 
is tri-diagonal, which reduces the computational complexity of simulations from 0(n 3 ) to Q(n). In 
our MCMC algorithm, we need to generate realizations from the Gaussian processes at thousands 
of points, so the speed-up afforded by the Brownian motion becomes almost a necessity, prompting 
us to use this process as our prior in all our examples. 



2.3 Priors for Hyperparameters 

The precision parameter 9 controls the degree of autocorrelation of our Brownian motion prior 
and influences the "smoothness" of the reconstructed population size trajectories. We place on 9 
a Gamma prior distribution with parameters a and (3. The other hyperparameter in our model is 
the upper bound of 1/N e (t), A. When this upper bound A is unknown, the model is unidentifiable 
(see Equation (5)). However, in many circumstances it is possible to obtain an upper bound A (or 
equivalently, a lower bound on N(t)) from previous studies and use this value to define the prior 
distribution of A. We use the following strategy to construct an informative prior for A. Let A 
denote our best guess of the upper bound, possibly obtained from previous studies. Then, the prior 
on A is a mixture of a uniform distribution for values to the left of A and an exponential distribution 
to the right: 

PW = 4 J {A<A} + U " 0ie-X< A -*>J {A >* }> (7) 

where e > is a mixing proportion. When A is considerably smaller than the unknown A, the 
recovered curve will be visibly truncated around A, indicating that one needs to try higher values 
of A. 
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2.4 Doubly Intractable Posterior 

Coalescent times T = {t n , t n -i, . . . , t\} of a given genealogy contain information needed to estimate 
N e (t) (see Equations (1) and (4)). Given that N e (t) is a one-to-one function of f(t) (Equation 
(5)), we will focus the discussion on the inference of f(t). The posterior distribution of f(t) and 
hyperparameters 9 and A becomes 



P(f(t),0, A|T) oc P(T\X, f(t))P(f(t)\e)P(9)P(X), 
where P(f(t)\6) is a Gaussian process prior with hyperparameter 9 and 



P(T|A, /(*)) = III 



CfcA 



k=2 



+ exp{-/(t fe _!)} 



exp 



—Ch 



A 



1 + exp {-/(*)} 



dt 



(8) 



(9) 



is the likelihood function for the isochronous data (heterochronous data likelihood has a similar 
form). The integral in the exponent of Equation (9) and the normalizing constant of Equation (8) 
are computationally intractable, making the posterior doubly intractable (Murray et al., 2006). 

Adams et al. (2009) faced a similar doubly intractable posterior distribution in the context of 
nonparametric estimation of intensity of the inhomogeneous Poisson process. These authors propose 
an introduction of latent variables so that the augmented data likelihood becomes tractable. This 
tractability makes the posterior distribution of latent variables and model parameters amenable 
to standard MCMC algorithms. Since Adams et al. (2009) based their data augmentation on 
the thinning algorithm for simulating inhomogeneous Poisson processes, we would like to devise 
a similar data augmentation based on a thinning algorithm for simulation of the coalescent with 
variable population size. In this simulation, we envision generating coalescent times assuming a 
constant population size and then thinning these times so that the distribution of the remain- 
ing (non-rejected) coalescent times follows the coalescent with variable population size. Since no 
thinning algorithm for simulating the coalescent process exists, we develop a series of such algo- 
rithms. In developing these algorithms, we find it useful to view the coalescent as a point process, 
a representation that we discuss below. 



2.5 The Coalescent as a Point Process 

The joint density of coalescent times is obtained by multiplying the conditional densities defined 
in Equations (1) or (4). This density can be expressed as 



P{h,...,t n - 1 \N e (t)) = l[\*(t k - 1 \t k )expl- I X*(t\t k )dt\, 

k=2 Jt k 



tfc-l 



(10) 



where \*{t\tk) denotes the conditional intensity function of a point process on the real line (Daley 
and Vere- Jones, 2002). For isochronous coalescent, the conditional intensity is defined by the step 
function: 



\*{t\t k ) = {^jNS)- 1 !^^, for k = 2,..., n, 
and the conditional intensity of the heterochronous coalescent point process is: 

A*(t|n, s, t k ) = y£) Xeitr'hteh,,}! forfe = 2,...,n. 



(11) 



(12) 
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This novel representation allows us to reduce the task of estimating N e (t) to the estimation of the 
inhomogeneous intensity of the coalescent point process and to develop simulation algorithms based 
on thinning. 



2.6 Coalescent Simulation via Thinning 

To the best of our knowledge, the only method available for simulating the coalescent under the 
deterministic variable population size model is a time transformation method (Slatkin and Hudson, 
1991; Hein et al., 2005). This method is based on the random time-change theorem due to Pa- 
pangelou (1972). Under the time transformation method, to simulate coalescent times, we proceed 
sequentially starting with k = n and t n = 0, generating t from an exponential distribution with 
unit mean, solving 



for tk-i analytically or numerically and repeating the procedure until k = 2. For isochronous 
coalescent, X*(u\th) is defined in Equation (11) and for the heterochronous coalescent, X*(u\tk) = 
A*(u|n, s,ifc) is the piecewise function defined in Equation (12). When N e (t) is stochastic, the 
integral in Equation (13) becomes intractable and the time transformation method is no longer 
practical. Instead, we propose to use thinning, a rejection-based method that does not require 
calculation of the integral in Equation (13). 

Lewis and Shedler (1979) proposed thinning a homogeneous Poisson process for the simulation 
of an inhomogeneous Poisson process with intensity \(t). The idea is to start with a realization 
of points from a homogeneous Poisson process with intensity A and accept /reject each point with 
acceptance probability X(t)/X, where X(t) < X. The collection of accepted points forms a realization 
of the inhomogeneous Poisson process with conditional intensity X(t). Ogata (1981) extended 
Lewis and Shedler's thinning for the simulation of any point process that is absolutely continuous 
with respect to the standard Poisson process. We develop a series of thinning algorithms for the 
coalescent process that are similar to Ogata's algorithms, but not identical to them. Algorithm 
1 outlines the simulation of n coalescent times under the isochronous sampling. Given tk, we 
start generating and accumulating exponential random numbers E{ with rate C^A, until tk-i = 
tk + Ei + E2 + . . . is accepted with probability \/N e (tk-i)X. (see Web Appendix A for details 
and simulation algorithms of coalescent times for heterochronous sampling). In order to ensure 
convergence of the algorithm, we require J °° jfj^ = 00 a.s., which is equivalent to requiring 
that all sampled lineages can be traced back to their single common ancestor with probability 1. 
Notice that N e (t) can be either deterministic or stochastic. The latter case is considered in Web 
Supplementary Algorithm 2, where we work with f(t) instead of N e (t) for notational convenience. 

If N e (t) is deterministic and equation (13) can be solved analytically, the time transformation 
method is likely to be more efficient that thinning since the thinning algorithm is an accept/reject 
algorithm with the acceptance probability highly dependent on the definition of A. However, 
efficiency of the thinning algorithm can be improved by replacing the constant upper bound A 
on 1/N e (t), by a piece- wise constant or a piece- wise linear function of local upper bounds in order 
to achieve higher acceptance probabilities, similarly to the adaptive rejection sampling of Gilks and 




(13) 



Wild (1992). 
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Algorithm 1 Simulation of isochronous coalescent times by thinning - N e (t) is a deterministic 
function 

Input: k = n, t n = 0, t = 0, 1/N e (t) < A, N e (t) 
Output: T = {t k )l =1 
1: while k > 1 do 

2: Sample ~ Exponential (C k X) and [/ ~ [7(0, 1) 
3: t=t+E 

4: if 17 < j^jj then 
5: k <— k — 1, t k <— t 

6: end if 
7: end while 



2.7 Data Augmentation and Inference 

As mentioned in the previous section, our thinning algorithm for the coalescent is motivated by 
our desire to construct a data augmentation scheme. We imagine that observed coalescent times T 
were generated by the thinning procedure described in Algorithm 1, so we augment T with rejected 
(thinned) points N. If we keep track of the rejected points resulting from Algorithm 1, then, given 
h, f(tk), f/V* = {/(*fc,i)j"I=i and A, we start Proposing new time points M k = {t kj i, t k ,m k } until 
t k -i is accepted, so that 



P(t fc -i,JV fc |t fc ,/(t fc _i),f^ fc ,A) = (C k X) r ^ +1 ex V {-C k X(t k -t k ^)} 



n 

i=i 



Ll + exp{-/(tfc_i)} 
1 

l + exp{-/(t feji )}_ ' 



(14) 



For the heterochronous coalescent (see Algorithm 3 in Web Supplemental Materials), Equation (14) 
is modified in the following way: 

P(tk-i,ATk\tk, /(tfc-i), fjv t , A, s, n) = (AC , fc ) 1+mo - fc exp{-AC , fc /(/o, fe )} 



1 \ ™ fc 1 

1 +exp{-/(t fc _ 1 )} J J! i + exp{/(t fe 



0} 



II [(AQ, fc ) mi ' fe exp{-AC lifc /(/,, fc )}] , (15) 



i=l 



where l{I^ k ) denotes the length of the interval and m^ k = Yli=i 1 e ^z,fc} denotes the 
number of latent points in interval I i>k . Let f TjA r = {{f{t k )} k=l , {{/(ifc,*)} ^\ } ^ =2 } , then the 
augmented data likelihood of T and N becomes 



P(T, A/"|f rx , A) = nP(tfc-i,^fe|tfc,/(t fc -i),f^ fc ,A). 



(16) 



k=2 



Then, the posterior distribution of f(t) and hyperparameters evaluated at the observed T and 
latent Af time points is 



P(f TM ,X,9\T,X) cx P(T,Af\f T sr, X)P(f TM \9)P(X)P(6). 



(17) 



The augmented posterior can now be easily evaluated since it does not involve integration of infinite- 
dimensional random functions. We follow Adams et al. (2009) and develop a MCMC algorithm 
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to sample from the posterior distribution (16). At each iteration of our MCMC, we update the 
following variables: (1) number of "rejected" points #JV; (2) the locations of the rejected points A^; 
(3) the function values fjyv and (4) the hyperparameters 6 and A. We use a Metropolis-Hastings 
algorithm to sample the number and locations of latent points and the hyperparameter A; we Gibbs 
sample the hyperparameter 9. Updating the function values fjyv is nontrivial, because this high 
dimensional vector has correlated components. In such cases, single-site updating is inefficient and 
block updating is preferred (Rue and Held, 2005). We use elliptical slice sampling, proposed by 
Murray et al. (2010), to sample fr,Af- The advantage of using the elliptical slice sampling proposal is 
that it does not require the specification of tunning parameters and works well in high dimensions. 
The details of our MCMC algorithm can be found in Web Appendix B. 

We summarize the posterior distribution of N e (t) by its empirical median and 95% Bayesian 
credible intervals (BCIs) evaluated at a grid of points. This grid can be made as fine as necessary 
after the MCMC is finished. That is, given the function values fjyv at coalescent and latent time 
points, and the value of the precision parameter 9 at each iteration, we sample the function values 
at a grid of points g = {g±, gs} from its predictive distribution f g ~ P(f s \f-j-,Mj 6), an d evaluate 

{Ne{9i)}f =V 



3 Results 



3.1 Simulated Data 

We simulate three genealogies relating 100 individuals, sampled at the same time t = (isochronous 
sampling) under the following demographic scenarios: 1) constant population size trajectory: 
N e {t) = 1; 2) exponential growth: N e (t) = 25e _5i ; and 3) population expansion followed by a 
crash: N e (t) = e 4 'l{ te j ,o.5]} + e ~ 2 * +3 l{te(o.5,oo)}- We compare the posterior median with the truth 
by the sum of relative errors (SRE): 

gM = -y«>i , (18) 

i=1 ^e(Si) 

where N e (si) is the estimated trajectory at time Sj with si = t±, the time to the most recent 
common ancestor and sk = t n = for any finite K. Similarly, we compute the mean relative width 
(MRW) of the 95% BCIs defined in the following way: 

mrw = y i^w-^wi ^ (19) 

f-f KN e (si) 
i=i 

We also compute the percentage of time, the 95% BCIs cover the truth (envelope) in the following 
way: 

™, etope = ^'(%M^(-)<%i».)) (20) 

As a measure of the frequentist coverage, we calculate the percentage of times the truth is completely 
covered by the 95% BCIs (envelope = 1), by simulating each demographic scenario and performing 
Bayesian estimation of each such simulation 100 times. 
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Table 1: Summary of Simulation Results Depicted in Figure 2. SRE is the sum of relative errors as defined 
in Equation (19), MRW is the mean relative width of the 95% BCI as defined in Equation (20), envelope is 
calculated as in Equation (21) and variation is calculated as in Equation (22). 



Simulations 



SRE 



MRW Envelope Variation 



GMRF GP 



GMRF GP GMRF GP GMRF GP TRUTH 



Constant 



50.41 4.15 



4.21 0.72 100.0% 100.0% 2.27 0.08 0.00 



Exp. growth 47.65 33.60 2.55 2.35 100.0% 100.0% 30.19 52.41 24.80 
Expansion/crash 181.88 140.88 10.7 7.26 77.33% 92.0% 5.69 6.94 13.46 



We compute the three statistics for the three simulation scenarios for K = 150 at equally spaced 
time points (Table 1). These statistics do not change significantly when we use higher values of K. 
Additionally, we compute the variation of N e (t) over a regular grid of K = 150 points as follows: 



For all simulations, we set the mixing parameter e of the prior density for A (Equation (7)) 
to e = 0.01. The parameters of the Gamma prior on the GP precision parameter 9 were set to 
a = (3 = 0.001. We summarize our posterior inference in Figure 2 and compare our GP method 
to the GMRF smoothing method (Minin et al., 2008). The effective population trajectory is log 
transformed and time is measured in units of generations. 

For the constant population scenario (first row in Figure 2), the truth (dashed lines) is almost 
perfectly recovered by the GP method (solid black line) and the 95% BCIs shown as gray shaded 
areas are remarkably tight. For the exponential growth simulation (second row), the GMRF method 
recovers the truth better in the right tail, while our GP method recovers it much better in the left 
tail. The higher variation of the GP reconstruction in the right tail makes this measure higher than 
for the GMRF reconstruction. Overall, our GP method better recovers the truth in the exponential 
growth scenario, as evidenced by SREs and MRWs in Table 1. The last row in Figure 2 shows 
the results for a population that experiences expansion followed by a crash in effective population 
size. In this case, 95% BCIs of the two methods do not completely cover the true trajectory. 
While an area near the bottleneck is particularly problematic, the GP method's envelope is much 
higher (92%) than the envelope produced by the GMRF method (77.3%), the variation recovered 
by the GP method is closer to the true variation in all simulation scenarios and in general, in terms 
of the four statistics employed here, the GP method shows better performance. Results for the 
GMRF method were obtained using the BEAST software (Drummond et al., 2012) with running 
times ranging from 25 to 40 minutes, while results for the GP method were obtained using R with 
running times ranging from 60 to 180 minutes. Although our GP implementation takes longer, we 
obtain better performance in a still reasonable amount of time. 

Next, we simulate each of the three scenarios 100 times and compute the four statistics de- 
scribed before for both methods. The distributions of these statistics are represented by the boxplots 
depicted in Figure 3. In general, our GP method has smaller SREs, except in the constant case, 
where the distributions look very similar; smaller MRWs, larger envelopes and variation closer to 
the truth. Additionally, we calculate the percentage of times, the envelope is 1 as a proxy for 
frequentist coverage of the 95% BCIs. Since the 95% BCIs are calculated pointwise at 150 equally 



K-l 




(21) 



i=i 
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Figure 2: Simulated data under the constant population size (first row), exponential growth (second row) 
and expansion followed by a crash (third row). The simulated points are represented by the points at the 
bottom of each plot. We show the log of the effective population size trajectory estimated under the Gaussian 
Markov random field smoothing (GMRF) method and our method: Gaussian process-based nonparametric 
inference of effective population size (GP). We show the true trajectories as dashed lines, posterior medians 
as solid black lines and 95% BCIs by gray shaded areas. 
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Figure 3: Boxplots of SRE (top left), MRW (top right), envelope (bottom left) and variation (bottom right) 
based on 100 simulations for a constant trajectory, exponential growth and expansion followed by crash. The 
numbers above the boxplots of the bottom left plot represent the estimated frequentist coverage of the 95% 
BCIs, and the dashed lines in the bottom right plot indicate variations of the true simulated trajectories. 



spaced points, we do not necessarily expect frequentist coverage to be close to 95%. The results 
are shown as the numbers at the top of the right plot in Figure 3. The coverage levels obtained 
using the GP method are larger than those obtained using the GMRF method. 

3.2 Egyptian HCV 

Hepatitis C virus was first identified in 1989. By 1992, when HCV antibody testing became widely 
available, the prevalence of HCV in Egypt was about 10.8%. Today, Egypt is the country with 
the highest HCV prevalence (Miller and Abu-Raddad, 2010). A widely held hypothesis that can 
explain the epidemic emphasizes the role of a parenteral antischistosomal therapy (PAT) campaign, 
that started in the 1920s, combined with lack of sanitary practices. The campaign was discontinued 
in the 1970s when the intravenous treatment was gradually replaced by oral administration of the 
treatment (Ray et al., 2000). Coalescent demographic methods developed over the last 10 years 
demonstrated evidence in favor of this hypothesis (Pybus et al., 2003; Drummond et al., 2005; 
Minin et al., 2008). Therefore, this example is well suited for testing our method. We analyze the 
genealogy estimated by Minin et al. (2008), based on 63 HCV sequences sampled in Egypt in 1993, 
and compare our method to the GMRF smoothing method (Minin et al., 2008). The results are 
depicted in Figure 4, with time scaled in units of years. In line with previous results (Pybus et al., 
2003; Drummond et al., 2005; Minin et al., 2008), our estimation recovers the exponential growth 
of the HCV population size starting from the 1920s when the intravenously administered PAT was 
introduced. Both Pybus et al. (2003) and Minin et al. (2008) hypothesize that the population 
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Figure 4: Egyptian HCV. The first plot (left to right) is one possible genealogy reconstructed by Minin et al. 
(2008). The next two plots represent the log of scaled effective population trajectory estimated using the 
GMRF smoothing method and our GP method. The posterior medians for the last two plots are represented 
by solid black lines and the 95% BCI's are represented by the gray shaded areas. The vertical dashed lines 
mark the years 1920 (the start of intravenous PAT) , 1970 (the end of intravenous PAT) and 1993 (sampling 
time of sequences). 

trajectory remained constant before the start of the exponential growth. The GMRF and GP 
approaches disagree the most on the HCV population size reconstruction prior to 1920s. The 
GP method produces narrower BCIs near the root of the genealogy (1710-1770) than the GMRF 
approach. In contrast, GP BCIs are inflated in the time period from 1770 to 1900 in comparison 
to the GMRF results. We believe that the uncertainty estimates produced by the GP approach 
are more reasonable than the GMRF result, because there are multiple coalescent events during 
1710- 1770, providing information about the population size, while the time interval 1770 - 1900 has 
no coalescent events, a data pattern that should result in substantial uncertainty about the HCV 
population size. Another notable difference between the GMRF and GP methods is in estimation 
of the HCV population trajectory after 1970. The GP approach suggests a sharper decline in 
population size during this time interval. 

3.3 Seasonal Human Influenza 

Here, we estimate population dynamics of human influenza A, based on 288 H3N2 sequences sam- 
pled in New York state from January, 2001 to March, 2005. Sequences from the coding region 
of the influenza hemagglutinin (HA) gene of H3N2 influenza A virus from New York state were 
collected from the NCBI Influenza Database (Influenza Genome Sequencing Project, 2011), incor- 
porating the exact dates of viral sampling in weeks (heterochronous sampling) and aligned using 
the software package MUSCLE (Edgar, 2004). These sequences form a subset of sequences ana- 
lyzed in (Rambaut et al., 2008). We carried out a phylogenetic analysis using the software package 
BEAST (Drummond et al., 2012) to generate a majority clade support genealogy with median 
node heights as our genealogical reconstruction. The reconstructed genealogy is depicted in the 
left plot of Figure 5. Demography of H3N2 influenza A virus in temperate regions, such as New 
York, is characterized by epidemic peaks during winters followed by strong bottlenecks at the end 
of epidemic seasons. As expected, our method recovers the peaks in the effective number of infec- 
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Figure 5: H3N2 Influenza A virus in New York state. The first plot (left) is the estimated genealogy. The 
second and third plots are the GMRF and GP estimations of log scaled effective population trajectories. 
Winter seasons are represented by the doted shaded areas. Posterior medians are represented by solid black 
lines and 95% BCIs are represented by gray shaded areas. 



tions during all seasons starting from the 2001-2002 flu season (flu seasons are represented as doted 
rectangles in Figure 5). The GMRF method fails to recover the peak during the 2002-2003 season. 
The large uncertainty in population size estimation during the 1999-2000, 2000-2001, and at the 
beginning of 2005-2006 seasons is explained by the small number of coalescent events during those 
time periods, however, this uncertainty is larger in the GMRF recovered trajectory. During the 
2001-2002 flu season, the GMRF method fails to recover the expected trajectory of a peak followed 
by a bottleneck and instead, this method recovers an epidemic that started during the end of 2001, 
increased and remained "at peak" until the end of the following winter. The GMRF recovered 
trajectory during the winter season of 2003 exhibits a steep decrease. In contrast, the GP method 
detects a late peak during the 2001-2002 season, followed by a decline in the number of infections. 
There is a small bump in the effective population size of influenza in the winter of 2003, which is 
more realistic than a steady decline in the number of infections estimated by the GMRF method. 
Overall, we believe that the GP reconstructed trajectory is more feasible from an epidemiological 
point of view than the GMRF population size reconstruction. 



4 Prior Sensitivity 

In all our examples, we placed a Gamma prior on the precision parameter 9 with parameters 
a = 0.001 and (3 = 0.001. This precision parameter, unknown to us a priori, controls the smoothness 
of the GP prior. We investigate the sensitivity of our results to the Gamma prior specification using 
the Egyptian HCV data. In the first plot of Figure 6, we show the prior and posterior distributions 
of 9 under our default prior. The difference in densities suggests that prior choices do not have an 
impact on the posterior distribution. Since the mean of a Gamma distributed random variable is 
a/ /3, we investigate the sensitivity by fixing (3 = .001 and setting the value of a to 0.001, 0.002, 
0.005, 0.01 and 0.1, corresponding to prior means 1, 2, 5, 10 and 100 and variances 1000, 2000, 
5000, 10000 and 100000, and by trying two extremes: a = 1, f3 = .0001 and a = .001, j3 = 1, to 
examine the posterior distribution of 9 under these priors. The posterior sample boxplots displayed 
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in Figure 6 demonstrate that our results are fairly robust to different choices of a. 

5 Sensitivity to the Order of the Gaussian Process 

We evaluate our GP-based method for three different Gaussian Process priors for the Egyptian HCV 
genealogy. In Figure 7, we show the recovered trajectories for Brownian Motion (BM), Ornstein- 
Uhlenbeck (OU) and approximated Integrated Brownian motion (IBM) (Lindgren and Rue, 2008). 
The common characteristic of these three priors is the sparsity of their precision matrices (inverse 
covariance matrix), allowing for computational tractability. Figure 7 shows that the order of the 
process does make a difference, but only in regions with large posterior uncertainty, where prior 
influence is more pronounced. 

6 Discussion 

We propose a new nonparametric method for estimating population size dynamics from gene ge- 
nealogies. To the best of our knowledge, we are the first to solve this inferential problem using 
modern tools from Bayesian nonparametrics. In our approach, we assume that the population size 
trajectory a priori follows a transformed Gaussian process. This flexible prior allows us to model 
population size trajectory as a continuous function without specifying its parametric form and 
without resorting to artificial discretization methods. We tested our method on simulated and real 
data and compared it with the competing GMRF method. On simulated data, our method recov- 
ers the truth with better accuracy and precision. On real data, where true population trajectories 
are unknown, our method recovers known epidemiological aspects of the population dynamics and 
produces more reasonable estimates of uncertainty than the competing GMRF method. 

We bring Bayesian nonparametrics into the coalescent framework by viewing the coalescent as 
a point process. This representation allows us to adapt Bayesian nonparametric methods originally 
developed for Poisson processes to the coalescent modeling. In particular, it allows us to adapt 
the thinning-based data augmentation for Poisson processes developed by Adams et al. (2009). We 
devise an analogous data augmentation for the coalescent by developing a series of new thinning 
algorithms for the coalescent. Although we use these algorithms in a very narrow context, our novel 
coalescent simulation protocols should be of interest to a wide range of users of the coalescent. For 
example, we are not aware of any competitors of our Web Supplementary Algorithms 2 and 4 that 
allow one to simulate coalescent times with a continuously and stochastically varying population 
size. 

Our method works with any Gaussian process with mean and covariance matrix C, where the 
latter controls the level of smoothness and autocorrelation. For computational tractability however, 
sparsity in the precision matrix (inverse covariance matrix) may be necessary for complex trajec- 
tories with a high number of latent points. One way to achieve sparse matrix computations and 
computational tractability is to use GP that is also Markov. In all our examples, we use Brownian 
motion with precision parameter 9; however, the nondifferentiability characteristic of the Brown- 
ian motion is compensated by the fact that our estimate of effective population trajectory is the 
posterior median evaluated pointwise, which is smoother than any of the sampled posterior curves. 
Additionally, we compared Brownian motion, Ornstein-Uhlenbeck and a higher order integrated 
Brownian motion for one of our examples and obtained very similar results under all three priors 
(see Web Supplementary Materials). Finite sample distributions under these three priors enjoy 
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sparse precision matrices that yield computational tractability comparable to the GMRF method. 
In our Brownian motion prior, the precision parameter controls the level of smoothness of the es- 
timated population size trajectory. We find that this important parameter shows little sensitivity 
to prior perturbations. 

Our method assumes that a genealogy or tree is given to the researcher. However, genealogies are 
themselves inferred from molecular sequences, so we need to incorporate genealogical uncertainty 
into our estimation. Our framework can be extended to inference from molecular sequences instead 
of genealogies by introducing another level of hierarchical modeling into our Bayesian framework, 
similar to the work of Drummond et al. (2005) and Minin et al. (2008). Further, we plan to extend 
our method to handle molecular sequence data from multiple loci as in (Heled and Drummond, 
2008). Finally, we would like to extend our nonparametric estimation into a multivariate setting, 
so that we can estimate cross correlations between population size trajectories and external time 
series. Estimating such correlations is a critical problem in molecular epidemiology. 

We deliberately adapted the work of Adams et al. (2009) on estimating the intensity function of 
an inhomogeneous Poisson process, as opposed to alternative ways to attack this estimation problem 
(M0ller et al., 1998; Kottas and Sanso, 2007), to the coalescent. We believe that among the state- 
of-the-art Bayesian nonparametric methods, our adopted GP-based framework is the most suitable 
for developing the aforementioned extensions. First, it is straightforward to incorporate external 
time series data into our method by replacing a univariate GP prior with a multivariate process 
that evolves the population size trajectory and another variable of interest in a correlated fashion 
(Teh et al., 2005). Second, the fact that our method does not operate on a fixed grid is critical for 
relaxing the assumption of a fixed genealogy, because fixing the grid a priori is problematic when 
one starts sampling genealogies, including coalescent times, during MCMC iterations. 

Finally, since the coalescent model with varying population size can be viewed as a particular 
example of an inhomogeneous continuous-time Markov chain, all our mathematical and compu- 
tational developments are almost directly transferable to this larger class of models. Therefore, 
our developments potentially have implications for nonparametric estimation of inhomogeneous 
continuous-time Markov chains with numerous applications. 
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A Coalescent Simulation Algorithms 



Proposition 1. Algorithm 1 generates t n < t n -i < ■ ■ ■ <t\, such that 



P(t fe _i > t\t k ) = exp - 




N e (x) 



Ckdx 



(A.l) 



where N e (t) is known deterministically. 

Proof. Let Tj = tk + E\ + . . . + Ei, where {Ei} c £L l are iid exponential Exp(CkX) random numbers. 
Given tk, Algorithm 1 generates and accumulates iid exponential random numbers until Tj is 
accepted with probability l/XN e (Ti), in which case, Tj is labeled tk-i- Let N(tk,t\ = #{i > 1 : 
tk < Ti < t} denote the number of iid exponential random numbers generated in (tk,t\. Then, 
{N(tk,t],t > tk} constitutes a Poisson process with intensity CkX. Then, given N(tk,t] = 1, the 
conditional density of a point x in (tk, t] is l/(t — tk) and the probability of accepting such a point 
as a coalescent time point with variable population size is l/XN e (x). Hence 




(A.2) 



and 




(A.3) 



Then, 



P(tk-i > t\t k ) = Y, > t\tk,N(t k ,t] = m)P(N(t k ,t] = m) 






□ 



Algorithm 3 and 4 are analogous heterochronous versions of Algorithm 1 and 2. 
An R implementation of these algorithms is available at 
http: //www. stat .Washington. edu/people/jpalacio . 
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Algorithm 2 Simulation of isochronous coalescent times by thinning with f(t) ~ GV(0, C(0)) 



Input: k = n, t n = 0, t = 0, i,- = 0, rrij = 0, j = 2, . . . , n, A 
Output: T = {t k } n k=1 , M = {{t Kl }T = \T k=2 , i TM 
1: while fc > 1 do 

2: Sample E ~ Exponential (C k X) and U ~ U(Q, 1) 
3: t=t+E 

4: Sample /(t) ~ P(/(t)|{/(t,)}P =fc , {{/fc)}™'iKU; 

5: if ^ l+cxp(-/(t)) then 
6: k 4- k - 1, tfc «- t 

7: else 

8: i k 4- i k + 1, m k 4- m k + 1, <- * 

9: end if 

10: end while 



Algorithm 3 Simulation of heterochronous coalescent by thinning - JV e (t) is a deterministic func- 
tion^ 

Input: ni,n 2 , . . . ,n m , si, ...,s m , 1/N e (t) < A, N e (i), m 
Output: for n = J2] l =1 n 3 ,T = {t k } n k=l 
1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 



i = 1, j = n — 1, n = n±, t = t n = si 
while i < m + 1 do 

Sample E ~ £7xp((£)A) and [/ ~ U(0, 1) 

if ^ JVe(ti-£)A then 
\f t + E < Si+i then 
; t 4- t + £ 
j 4— j — 1, n ■<— n — 1 
if n > 1 then 

go to 2 
else 

go to 14 
end if 
else 

i 4— i + 1, t 4— Si, n 4— n + rii 
end if 
else 

t 4- t + E 
end if 
end while 
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Algorithm 4 Simulation of heterochronous coalescent by thinning with f(t) ~ GV(0, C(8)) 

Input: ni,n 2 , n m , si = 0,...,s m , ij = 0, mj = 0, j = 2, . . . , n, A, m 
Output: for n = Zf=i nj,T = {t k } n k=1 , N = {{t M }£\}JU, t TM 

1: i = 1, j = n — 1, n = n±, t = t n = si 

2: while i < m + 1 do 

3: Sample E~ Exp{(%) A) and U~ 17(0, 1) 

4: Sample /(t + £7) ~ P(/(* + £7)|{/(t,)}? =fc , {{/(*m)}£'i}?=*; 0) 

5: if ^ i+exp(-/(*+s)) then 

6: if t + E < Si+i then 

7: tj<-t<-t + E 

8: j ■<— j — 1, n <— n — 1 

9: if n > 1 then 

10: go to 2 

11: else 

12: go to 14 

13: end if 

14: else 

15: i •<— i + t 4— Si, n <— n + rii 

16: end if 

17: else 

18: if t + £ < Sj + i then 

19: tj+i,i j+ i <~t + E, ij+i <- + 1 

20: end if 

21: t<-t + E 

22: end if 
23: end while 
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Figure 6: Prior sensitivity on the GP precision parameter. Left plot shows the prior and posterior dis- 
tributions represented by dashed line and vertical bars respectively. Right plot shows the boxplots of the 
posterior distributions of the precision parameter when the prior distributions differ in mean and variance 
of the precision parameter 8. These plots are based on the Egyptian HCV data. 
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Figure 7: Egyptian HCV recoverd by placing three different Gaussian process priors. The first plot (left to 
right) corresponds to a Brownian motion (BM), the second - to Ornstein-Uhlenbeck (OU) and the last one 
- to the approximated integrated Brownian motion (IBM). 
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B MCMC Sampling 



Since the coalescent under isochronous sampling is a particular case of the coalescent model under 
heterochronous sampling, we employ here the notation of the heterochronous coalescent, under- 
standing that Co,fc = Cfej Io,k = (ife; t k -i] an d i = for isochronous data. 

Sampling the number of latent points. A reversible jump algorithm is constructed for the 
number of "rejected" points. We propose to add or remove points with equal probability in each 
interval defined by Equations (3) and (4). When adding a point in a particular interval, we propose 
a location uniformly from the interval and its predicted function value f(t*) ~ P(f(t*)\{f,j\f, 9). 
When removing a point, we propose to remove a point selected uniformly from the pool of rejected 
points in that particular interval. We add points with proposal distributions q % up and remove points 
with proposal distributions Qj'^ . Then, 

LLC/ LU I L 

, k _ P(f(t*)\T,M,9) 

Qup ~ ma — ' ( } 



qf = -— , (B.2) 



and the acceptance probabilities are: 



i,k = l(Ij,k)^C it k m „, 

*"» ( m ., + l)(l + e /(**))' ^ 6 > 

i,k _ ^,fc(l + e /(t,) ) 

Sampling locations of latent points. We use a Metropolis-Hastings algorithm to update the 
locations of latent points. We first choose an interval defined by Equations (3) and (4) with 
probability proportional to its length and we then propose point locations uniformly at random in 
that interval together with their predictive function values ft* ~ P(ft* I^TX' Since the proposal 
distributions are symmetric, the acceptance probabilities are: 

ik l + e /W 

° ' = TT^) • (R5) 



Sampling transformed effective population size values. We use an elliptical slice sampling 
proposal described in (Murray et al., 2010). In both cases, isochronous or heterochronous, the full 
conditional distribution of the function values fjyv is proportional to the product of a Gaussian 
density and the thinning acceptance and rejection probabilities: 

P(f TX \T,X, A, 9) oc P(f T ^\9)L(f TM ), (B.6) 

where 

^rx)=n( 1+e - J (,^ )n rT7i M- < bj) 
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Sampling hyperparameters. The full conditional of the precision parameter 9 is a Gamma 
distribution. Therefore, we update 9 by drawing from its full conditional: 



9\f TM , T,M- Gamma ( a* = a + O^Jl, f = f, + f T^ f ^ 



r 



(B. 



where Q = \C~ X . 

For the upper bound A on N e (t)~ x , we use the Metropolis-Hastings update by proposing new values 
using a uniform proposal reflected at 0. That is, we propose A* from U(X — a, A + a). If the proposed 
value A* is negative, we flip its sign. Since the proposal distribution is symmetric, the acceptance 
probability is: 

P(X*) /\*\#{ AfuT } 



P(A) VA 



where -P(A) is defined in Equation (7). 



cxp 



(A*-A)^^C^Z(^) 



k=2 i=l 



(B.9) 
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